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Abstract 

Mappings to structured output spaces 
(strings, trees, partitions, etc.) are typi- 
cally learned using extensions of classifica- 
tion algorithms to simple graphical struc- 
tures (eg., linear chains) in which search and 
parameter estimation can be performed ex- 
actly. Unfortunately, in many complex prob- 
lems, it is rare that exact search or parame- 
ter estimation is tractable. Instead of learn- 
ing exact models and searching via heuristic 
means, we embrace this difficulty and treat 
the structured output problem in terms of 
approximate search. We present a frame- 
work for learning as search optimization, and 
two parameter updates with convergence the- 
orems and bounds. Empirical evidence shows 
that our integrated approach to learning and 
decoding can outperform exact models at 
smaller computational cost. 

1. Introduction 

Many general techniques for learning and decoding 
with structured outputs are computationally demand- 
ing, are ill-suited for dealing with large data sets, 
and employ parameter optimization for an intractable 
search (decoding) problem. In some instances, such as 
syntactic parsing, efficient task-specific decoding algo- 
rithms have been developed, but, unfortunately, these 
are rarely applicable outside of one specific task. 

Rather than separating the learning problem from the 
decoding problem, we propose to consider these two 
aspects in an integrated manner. By doing so, we are 
able to learn model parameters appropriate for the 
search procedure, avoiding the need to heuristically 
combine an a priori unrelated learning technique and 
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search algorithm. After phrasing the learning problem 
in terms of search, we present two online parameter up- 
date methods: a simple perceptron-style update and 
an approximate large margin update. We apply our 
model to two tasks: a simple syntactic chunking task 
for which exact search is possible (to allow for com- 
parison to exact learning and decoding methods) and 
a joint tagging/chunking task for which exact search 
is intractable. 

2. Previous Work 

Most previous work on the structured outputs problem 
extends standard classifiers to linear chains. Among 
these are maximum entropy Markov models and con- 
ditional random fields (?; ?); case-factor diagrams (?); 
sequential Gaussian process models (?); support vec- 
tor machines for structured outputs (?) and max- 
margin Markov models (?); and kernel dependency 
estimation models (?). These models learn distribu- 
tions or weights on simple graphs (typically linear 
chains). Probabilistic models are optimized by gra- 
dient descent on the log likelihood, which requires 
computable expectations of features across the struc- 
ture. Margin-based techniques are optimized by solv- 
ing a quadratic program (QP) whose constraints spec- 
ify that the best structure must be weighted higher 
than all other structures. Linear chain assumptions 
can reduce the exponentially-many constraints to a 
polynomial, but training remains computationally ex- 
pensive. 

Recent effort to reduce this computational demand 
considers employing constraints that the correct out- 
put only outweigh the fc-best model hypotheses (?). 
Alternatively an online algorithm for which only very 
small QPs are solved is also possible (?). 

At the heart of all these algorithms, batch or online, 
likelihood- or margin-based, is the computation: 

y = argmax/(a;,y; w) (1) 
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Algo Sea,rch(problern, initial, enqueue) 

nodes <— MakeQueue(MakeNode(problem,initial)) 

while nodes is not empty do 

node <— Remove Front ( nodes) 

if GoalTest( node) then return node 

next <— Operators(node) 

nodes <— enqueue(problem, nodes, next) 
end while 
return failure 



This seemingly innocuous statement is necessary in 
all models, and "simply" computes the structure y 
from the set of all possible structures y that maxi- 
mizes some function / on an input x, parametrized by 
a weight vector w. This computation is typically left 
unspecified, since it is "problem specific." 

Unfortunately, this argmax computation is, in real 
problems with complex graphical structure, often in- 
tractable. Compounding this issue is that this best 
guess y is only one ingredient to the learning algo- 
rithms: likelihood-based models require feature expec- 
tations and the margin-based methods require cither 
a fc-best list of best y or a marginal distribution across 
the graphical structure. One alternative that alleviates 
some of these issues is to use a perceptron algorithm, 
where only the argmax is required (?), but perfor- 
mance can be adversely affected by the fact that even 
the argmax cannot be computed exactly; see (?) for 
example. 

3. Search Optimization 

We present the Learning as Search Optimization 
(LaSO) framework for predicting structured outputs. 
The idea is to delve into Eq (JT|) to first reduce the 
requirement that an algorithm need to compute an 
argmax, and also to produce generic algorithms that 
can be applied to problems that are significantly more 
complex that the standard sequence labeling tasks that 
the majority of prior work has focused on. 

3.1. Search 

The generic search problem is covered in great depth 
in any introductory AI book. Its importance stems 
from the intractability of computing the "best" solu- 
tion to many problems; instead, one must search for 
a "good" solution. Most AI texts contain a definition 
of the search problem and a general search algorithm; 
we work here with that from ? (?). A search prob- 
lem is a structure containing four fields: states (the 
world of exploration), OPERATORS (transitions in the 
world), GOAL test (a subset of states) and path COST 
(computes the cost of a path). 

One defines a general search algorithm given a search 
problem, an initial state and a "queuing function." 
The search algorithm will either fail (if it cannot find 
a goal state) or will return a path. Such an algorithm 
(Figure [TJ operates by cycling through a queue, tak- 
ing the first element off, testing it as a goal and ex- 
panding it according to operators if otherwise. Each 
node stores the path taken to get there and the cost of 
this path. The enqueue function places the expanded 
nodes, next, onto the queue according to some vari- 
able ordering that can yield depth-first, breadth-first, 



Figure 1. The generic search algorithm. 

greedy, beam, hill-climbing, and A* search (among 
others). Since most search techniques can be described 
in this framework, we will treat it as fixed. 

3.2. Search Parameterization 

Given the search framework described, for a given task 
the search problem will be fixed, the initial state will 
be fixed and the generic search algorithm will be fixed. 
The only place left, therefore, for parameterization 
is in the enqueue function, whose job it is to essen- 
tially rank hypotheses on a queue. The goal of learn- 
ing, therefore, is to produce an enqueue function that 
places good hypotheses high on the queue and bad 
hypotheses low on the queue. In the case of optimal 
search, this means that we will find the optimal solu- 
tion quickly; in the case of approximate search (with 
which we are most interested), this is the difference 
between finding a good solution or not. 

In our model, we will assume that the enqueue func- 
tion is based on two components: a path component g 
and a heuristic component h, and that the score of a 
node will be given by g + h. This formulation includes 
A* search when h is an admissible heuristic, heuristic 
search when h is inadmissible, best-first search when 
h is identically zero, and any variety of beam search 
when a queue is cut off at a particular point at each 
iteration. We will assume h is given and that g is a 
linear function of features of the input x and the path 
to and including the current node, n: g — w <&{x, n), 
where <&(■, ■) is the vector of features. 

3.3. Learning the Search Parameters 

The supervised learning problem in this search-based 
framework is to take a search problem, a heuristic 
function, and training data with the goal of produc- 
ing a good weight vector w for the path function g. 
As in standard structured output learning, we will as- 
sume that our training data consists of ./V-many pairs 
(x^ n \y^) G X x y that tell us for a given input x^ 
what is the correct structured output y( n '. We will 
make one more important monotonicity assumption: 
for any given node s € S and an output y 6 y, we 
can tell whether s can or cannot lead to y. In the case 
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Algo Le&m(problem, initial, enqueue, w, x, y) 
nodes <— MakeQueue(MakeNode(problem,initial)) 
while nodes is not empty do 
node <— Remove Front ( nodes) 
if none of nodes U {node} is y-good or 

GoalTest( node) and node is not j/-good then 
sibs <— siblings (node, y) 
w <— update(w, x, sibs, node U nodes) 
nodes <— MakeQueue(si6s) 
else 

if GoalTest(norfe) then return iu 

next <— Operators(norfe) 
nodes <— enqueue(problem, nodes, next, w) 
end if 
end while 



Figure 2. The generic search/learning algorithm. 

that s can lead to y, we refer to s as "y-good. "[]] 

The learning problem can thus be formulated as fol- 
lows: we wish to find a weight vector w such that: 
(1) the first goal state dequeued is y-good and (2) the 
queue always contains at least one y-good state. In 
this framework, we explore an online learning scenario, 
where learning is tightly entwined with the search pro- 
cedure. From a pragmatic perspective, this makes 
sense: it is useless to the model to learn parameters for 
cases that it will never actually encounter. We propose 
a learning algorithm of the form shown in Figure [2] In 
this algorithm, we write siblings (node, y) to denote the 
set of y-good siblings of this node. This can be cal- 
culated recursively by back-tracing to the first y-good 
ancestor and then tracing forward through only y-good 
nodes to the same search depth as n (in tasks where 
there is a unique y-good search path - which is com- 
mon - the sibling of a node is simply the appropriate 
initial segment of this path). 

There are two changes to the search algorithm to fa- 
cilitate learning (comparing Figure [l] and Figure |2| . 
The first change is that whenever we make an error (a 
non y-good goal node is dequeued or none of the queue 
is y-good), we update the weight vector w. Secondly, 
when an error is made, instead of continuing along this 
bad search path, we instead clear the queue and insert 
all the correct moves we could have made0 

1 We assume that the loss we optimize is monotonic on 
a path in S; in this paper, we only use 0/1 loss. 

2 Performing parameter optimization within search re- 
sembles reinforcement learning without the confounding 
factor of "exploration." Early research in reinforcement 
learning focused on arbitrary input/output mappings (?), 
though this was not framed as search. Later, associative 
RL was introduced, where a context input (akin to our 
input x) was given to a RL algorithm (?; ?). Similar ap- 
proaches attempt to predict value functions for general- 
ization using techniques such as temporal difference (TD) 
or Q-learning (?; ?; ?). More recently, ? (?) applied 



Note that this algorithm cannot fail (in the sense that 
it will always find a goal state). Aiming at a contra- 
diction, suppose it were to fail; this would mean that 
nodes would have become empty. Since "Operators" 
will never return an empty set, this means that sibs 
must have been empty. But since a node that is in- 
serted into the queue is either itself good or has an an- 
cestor that is good, so could never have become empty. 
(There may be a complication with cyclic search spaces 
- in this case, both algorithms need to be augmented 
with some memory to avoid such loops, as is standard.) 

3.4. Parameter Updates 

We propose two methods for updating the model pa- 
rameters. To facilitate discussion, we will refer to a 
problem as linearly separable if there exists a weight 
vector w with \\w\\ 2 < 1 such that the search algo- 
rithm parameterized by w (a) will not fail and (b) will 
return an optimal solution. Note that with this def- 
inition, linear separability is a joint property of the 
problem and the search algorithm: what is separable 
with exact search may not be separable with a heuris- 
tic search. In the case of linearly separable data, we 
define the margin as the maximal 7 such that the data 
remain separable when all y-good states are down- 
weighted by 7. In other words, 7 is the minimum over 
all decisions of max 9i (, \w T <f>(x, g) — w T <f>(x, b) |, where 
g is a y-good node and b is a y-bad node. 

Perceptron Updates. A simple perceptron-style 
update rule (?), given (w, x, sibs, nodes) is w <— w+A., 
where: 

A = y- jKgyrc) _ jfrpzyn) 
^—f \sibs\ t—* \nodes\ 

n£sibs n£nodes 

When an update is made, the feature vector for the 
incorrect decisions are subtracted off, and the feature 
vectors for all possible correct decisions are added. 
Whenever \sibs\ = \nodes\ = 1, this looks exactly like 
the standard perceptron update. When there is only 
one sibling but many nodes, this resembles the gradi- 
ent of the log likelihood for conditional models after 
approximating the "log sum exp" with Jensen's in- 
equality to turn it into a simple sum. When there 
is more than one correct next hypothesis, this update 
rule resembles that used in multi-label or ranking vari- 

RL techniques to solving combinatorial scheduling prob- 
lems, but again focus on the standard TD(A) framework. 
These frameworks, however, are not explicitly tailored for 
supervised learning and without the aid of our monotonic- 
ity assumption it is difficult to establish convergence and 
generalization proofs. Despite these differences, our search 
optimization framework clearly lies on the border between 
supervised learning and reinforcement learning, and fur- 
ther investigation may reveal interesting connections. 
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ants of the perceptron (?). In that work, different 
"weighting" schemes are proposed, including, for in- 
stance, one that weights the nodes in the sums pro- 
portional to the loss suffered; such schemes are also 
possible in our framework, but space prohibits a dis- 
cussion of them here. Based on this update, we can 
prove the following theorem: 

Theorem 1. For any training sequence that is sepa- 
rable with by a margin of size 7, using the perceptron 
update rule the number of errors made during train- 
ing is bounded above by R 2 j^ 2 , where, R is a constant 
such that for all training instances (x,y), for all nodes 
n in the path to y and all successors m of n (good or 
otherwise), ||$(x,n) — m)|| 2 < R. 

In the case of inseparable data, we follow ? (?) and 
define -D TO . 7 as the least obtainable error with weights 
w and margin 7 over the training data: D wrj — 
E s ( max {0i7 — Ps}) 2 ] 1 / 2 , where the sum is over all 
states leading to a solution and p s is the empirical 
margin between the correct state and the hypothesized 
state s. Using this notation, we obtain two corollaries 
(proofs are direct adaptations of ? (?) and ? (?)): 

Corollary 2. For any training sequence, the number 
of mistakes made by the training algorithm is bounded 
above by mm w ^(R + D wn ) 2 /j 2 , where R is as before. 

Corollary 3. For any i.i.d. training sequence of 
length n and any test example (x,y), the probabil- 
ity of error on the test example is bounded above by 
(2/(n + l))E{min. TO)7v R + D w _ 7 ) 2 /j 2 }, where the ex- 
pectation is taken over all n + 1 data points. 

Approximate Large Margin Updates. One ma- 
jor disadvantage to the perceptron algorithm is that it 
only updates the weights when errors are made. This 
can lead to a brittle estimate of the parameters, in 
which the "good" states are weighted only minimally 
better than the "bad" states. We would like to en- 
force a large margin between the good states and bad 
states, but would like to do so without adding signifi- 
cant computational complexity to the problem. In the 
case of binary classification, ? (?) has presented an 
online, approximate large margin algorithm that trains 
similarly to a perceptron called ALMA. The primary 
difference (aside from a step size on the weight up- 
dates) is that updates are made if either (a) the algo- 
rithm makes a mistake or (b) the algorithm is close to 
making a mistake. Here, we adapt this algorithm to 
structured outputs in our framework. 

Our algorithm, like ALMA, has four parameters: 
a,B,C,p. a determines the degree of approximation 
required: for a = 1, the algorithm seeks the true max- 
imal margin solution, for a < 1, it seeks one within a 



of the maximal. B and C can be seen as tuning param- 
eters, but a default setting of B = 1/a and C = \f2 
is reasonable (see Theorem 4 below). We measure the 
instance vectors with norm p and the weight vector 
with its dual value q (where l/p+ l/q = 1). We use 
p = q = 2, but large p produces sparser solutions, since 
the weight norm will approach 1. The update is: 



p(w + Ck- 1/2 p(A)) 



(3) 



Here, k is the "generation" number of the weight vec- 
tor (initially 1 and incremented at every update) and 
p(it) is the projection of u into the I2 unit sphere: 
it/max{l, ||u|| 2 }. One final change to the algorithm 
is to down-weight the score of all j/-good nodes by 
(1 — otjBk^ 1 / 2 . Thus, a good node will only survive if 
it is good by a large margin. This setup gives rise to a 
bound on the number of updates made (proof sketched 
in Appendix A) and two corollaries (proofs are nearly 
identical to Theorem 4 and (?)): 

Theorem 4. For any training sequence that is sep- 
arable with by a margin of size 7 using the approxi- 
mate large margin update rule with parameters a, B = 
\/8/a,C — y/2, the number of errors made during 
training is bounded above by -\ (~ — 1J +£—4. 

Corollary 5. Suppose for a given a, B and C are 
such that C 2 + 2(1 - a)BC = 1; letting p = (C7)- 2 , 
the number of corrections made is bounded above by: 



min —D 



10,7 ' 2 P 



P - + X -D ■ 
4 7 



1/2 



(4) 



Corollary 6. For any i.i.d. training sequence of 
length n and any test example {x,y), the probabil- 
ity of error on the test example is bounded above by 
(2/(n + 1))E{-}, where (•) is given in Eq Q and the 
expectation is taken over all n + 1 data points. 

4. Experiments 

4.1. Syntactic Chunking 

The syntactic chunking problem is a sequence segmen- 
tation and labeling problem; for example: 

[Great AmcricanjNp [saidjvp [itjpjp [increased] vp [its 
loan-loss rescrvesj^p [by]pp [$ 93 millionj^p [aftcrjpp 
[reviewing] vp [its loan portfoliojpjp , [raisingjvp [its total 
loan and real estate rescrvesjj^p [to]pp [$ 217 millionjj^p . 

Typical approaches to this problem recast it as a se- 
quence labeling task and then solve it using any of 
the standard sequence labeling models; see (?) for 
a prototypical example using CRFs. The reduction 
to sequence labeling is typically done through the 
"BIO" encoding, where the beginning of an X phrase is 
tagged B-X, the non-beginning (inside) of an X phrase 
is tagged I-X and any word not in a phrase is tagged 
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(outside). More recently, ? (?) have described a 
straightforward extension to the CRF (called a Semi- 
CRF) in which the segmentation and labeling is done 
directly. 

We explore similar models in the context of syntactic 
chunking, where entire chunks are hypothesized, and 
no reduction to word-based labels is made. We use 
the same set of features across all models, separated 
into "base features" and "meta features." The base 
features apply to words individually, while meta fea- 
tures apply to entire chunks. The base features we 
use are: the chunk length, the word (original, lower 
cased, stemmed, and original-stem), the case pattern 
of the word, the first and last 1, 2 and 3 characters, 
and the part of speech and its first character. We 
additionally consider membership features for lists of 
names, locations, abbreviations, stop words, etc. The 
meta features we use are, for any base feature 6, b 
at position i (for any sub-position of the chunk), b 
before/ after the chunk, the entire ^-sequence in the 
chunk, and any 2- or 3-gram tuple of 6s in the chunk. 
We use a first order Markov assumption (chunk label 
only depends on the most recent previous label) and 
all features are placed on labels, not on transitions. 
In this task, the arg max computation from Eq Q is 
tractable; moreover, through a minor adaptation of 
the standard HMM forward and backward algorithms, 
we can compute feature expectations, which enable us 
to do training in a likelihood-based fashion. 

Our search space is structured so that each state is 
the segmentation and labeling of an initial segment 
of the input string, and an operation extends a state 
by an entire labeled chunk (of any number of words) . 
For instance, on the example shown at the beginning 
of this section, the initial hypothesis would be empty; 
the first correct child would be to hypothesize a chunk 
of length 2 with the tag NP. The next correct hypoth- 
esis would be a chunk of length 1 with tag VP. This 
process would continue until the end of the sentence 
is reached. For beam search, we execute search as de- 
scribed, but after every expansion we only retain the 
b best hypotheses to continue on to the next round. 

Our models for this problem are denoted LaSOpj, and 
LaSOa&, where b is the size of the beam we use in 
search, which we vary over {1, 5, 25, oo}, where oo de- 
notes full, exact Viterbi search and forward-backward 
updates similar to those used in the semi-CRF. This 
points out an important issue in our framework: if the 
graphical structure of the problem is amenable to ex- 
act search and exact updates, then the framework can 
accommodate this. In this case, for example, when us- 
ing exact search, updates are only made at the end of 



Table 1. Results on syntactic chunking task; columns are 
training and testing time (h:m), and precision/recall/f- 
score on test data. 
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94.6 


94.4 


LaSOAoo 


20:12 




25 


94.0 


94.8 


94.4 



decoding when the highest ranking output is incorrect 
(after adjusting the weights down for LaSOa), but 
other than this exception, the sum over the bad nodes 
in the updates is computed over the entire search lat- 
tice and strongly resemble almost identical to those 
used in the conditional likelihood models for the gra- 
dient of the log normalization constant. We always use 
averaged weights. 

We report results on the CoNLL 2000 data set, which 
includes 8936 training sentences (212/c words) and 
2012 test sentences (47k words). We compare our pro- 
posed models against several baselines. The first base- 
line is denoted ZDJ02 and is the best system on this 
task to date (?). The second baseline is the likelihood- 
trained model, denoted SemiCRF. We use 10% of the 
training data to tune model parameters. The third 
baseline is the standard structured perceptron algo- 
rithm, denoted Perceptron. For the SemiCRF, this 
is the prior variance; for the online algorithms, this is 
the number of iterations to run (for ALMA, a = 0.9; 
changing a in the range [0.5, 1] does not affect the score 
by more than ±0.1 in all cases). 

The results, in terms of training time, test decoding 
time, precision, recall and f-score are shown in Table[T| 
As we can see, the SemiCRF is by far the most com- 
putationally expensive algorithm, more than twice as 
slow to train than even the LaSOPoo algorithm. The 
Perceptron has roughly comparable training time 
to the exactly trained LaSO algorithms (slightly faster 
since it only updates for the best solution) , but its per- 
formance falls short. Moreover, decoding time for the 
SemiCRF takes a half hour for the two thousand test 
sentences, while the greedy decoding takes only 52 sec- 
onds. It is interesting to note that at the larger beam 
sizes, the large margin algorithm is actually faster than 
the perceptron algorithm. 

In terms of the quality of the output, the SemiCRF 
falls short of the previous reported results (92.2 versus 
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94.1 f-score). Our simplest model, LaSOpi already 
outperforms the SemiCRF with an f-score of 92.4; 
the large margin variant achieves 93.0. Increasing the 
beam past 5 does not seem to help with large margin 
updates, where performance only increases from 94.3 
to 94.4 going from a beam of 5 to an infinite beam (at 
the cost of an extra 18 hours of training time). 

4.2. Joint Tagging and Chunking 

In Section |4.1[ we described an approach to chunking 
based on search without reduction. This assumed that 
part of speech tagging had been performed as a pre- 
processing step. In this section, we discuss models in 
which part of speech tagging and chunking are per- 
formed jointly. This task has previously been used as 
a benchmark for factorized CRFs (?). In that work, 
the authors discuss many approximate inference meth- 
ods to deal with the fact that inference in such joint 
models is intractable. 

For this task, we do use the BIO encoding of the 
chunks so that a more direct comparison to the fac- 
torized CRFs would be possible. We use the same 
features as the last section, together with the regular 
expressions given by (?) (so that our feature set and 
their feature set are nearly identical). We do, however, 
omit their final feature, which is active whenever the 
part of speech at position i matches the most common 
part of speech assigned by Brill's tagger to the word 
at position i in a very large corpus of tagged data. 
This feature is somewhat unrealistic: the CoNLL data 
set is a small subset of the Penn Treebank, but the 
Brill tagger is trained on all of the Treebank. By us- 
ing this feature, we are, in effect, able to leverage the 
rest of the Treebank for part of speech tagging. Using 
just their features without the Brill feature, our per- 
formance is quite poor, so we added the lists described 
in the previous section. 

In this problem, states in our search space are again 
initial taggings of sentences (both part of speech tags 
and chunk tags), but the operators simply hypothe- 
size the part of speech and chunk tag for the single 
next word, with the obvious constraint that an I-X 
tag cannot follow anything but a B-X or l-X tag. 

The results are shown in Table [21 The models are 
compared against Sutton, the factorized CRF with 
tree reparameterization. We do not report on infinite 
beams, since such a calculation is intractable. We re- 
port training tim^] testing time, tag accuracy, chunk 



Table 2. Results on joint tagging/chunking task; columns 
are time to train (h:m), tag accuracy, chunk accuracy, joint 
accuracy and chunk f-score. 
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accuracy and joint accuracy and f-score for chunking. 
In this table, we can see that the large margin algo- 
rithms are much faster to train than the perceptron 
(they require fewer iterations to converge - typically 
two or three compared to seven or eight). In terms 
of chunking f-score, none of the perceptron-style algo- 
rithms is able to out-perform the Sutton model, but 
our LaSOa algorithms easily outperform it. With a 
beam of only 1, we achieve the same f-score (93.9) and 
with a beam of 10 we get an f-score of 94.4. Comparing 
Table [I] and Table [2] we see that, in general, we can do 
a better job chunking with the large margin algorithm 
when we do part of speech tagging simultaneously. 

To verity Theorem 4 experimentally, we have run the 
same experiments using a 1000 sentence (25k word) 
subset of the training data (so that a positive margin 
could be found) with a beam of 5. On this data, La- 
SOa made 15932 corrections. The empirical margin 
at convergence was 1.299e — 2; according to Theorem 
4, the number of updates should have been < 17724, 
which is borne out experimentally. 

4.3. Effect of Beam Size 

Clearly, from the results presented in the preceding 
sections, the beam size plays an important role in the 
modeling. In many problems, particularly with gen- 
erative models, training is done exactly, but decoding 
is done using an inexact search. In this paper, we 
have suggested that learning and decoding should be 
done in the same search framework, and in this sec- 
tion we briefly support this suggestion with empirical 
evidence. For our experiments, we use the joint tag- 



Sutton et al. report a training time of 13.6 hours on 
5% of the data (400 sentences); it is unclear from their 
description how this scales. The scores reported from their 
model are, however, based on training on the full data set. 



ging/chunking model from Section 4.2 and experiment 
by independently varying the beam size for training 
and the beam size for decoding. We show these results 
in Table [3] where the training beam size runs verti- 
cally and the decoding beam size runs horizontally; 
the numbers we report are the chunk f-score. 
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Table 3. Effect of beam size on performance; columns are 
for constant decoding beam; rows are for constant training 
beam. Numbers are chunk f-score on the joint task. 





1 


5 


10 


25 


50 


1 


93.9 


92.8 


91.9 


91.3 


90.9 


5 


90.5 


94.3 


944 


94.1 


94.1 


10 


89.6 


94.3 


944 


94.2 


94.2 


25 


88.7 


94.2 


94.5 


94.3 


94.3 


50 


88.4 


94.2 


944 


94.2 


944 



In these results, we can see that the diagonal (same 
training beam size as testing beam size) is heavy, in- 
dicating that training and testing with the same beam 
size is useful. This difference is particularly strong 
when one of the sizes is 1 (i.e., pure greedy search is 
used). When training with a beam of one, decoding 
with a beam of 5 drops f-score from 93.9 (which is 
respectable) to 90.5 (which is poor). Similarly, when 
a beam of one is used for decoding, training with a 
beam of 5 drops performance from 93.9 to 92.8. The 
differences are less pronounced with beams > 10, but 
the trend is still evident. We believe (without proof) 
that when the beam size is large enough that the loss 
incurred due to search errors is at most the loss in- 
curred due to modeling errors, then using a different 
beam for training and testing is acceptable. However, 
when some amount of the loss is due to search errors, 
then a large part of the learning procedure is aimed 
at learning how to avoid search errors, not necessar- 
ily modeling the data. It is in these cases that it is 
important that the beam sizes match. 

5. Summary and Discussion 

In this paper, we have suggested that one views the 
learning with structured outputs problem as a search 
optimization problem and that the same search tech- 
nique should be applied during both learning and de- 
coding. We have presented two parameter update 
schemes in the LaSO framework, one perceptron-style 
and the other based on an approximate large margin 
scheme, both of which can be modified to work in ker- 
nel space or with alternative norms (but not both) . 

Our framework most closely resembles that used by 
the incremental parser of ? (?). There are, how- 
ever, several differences between the two methodolo- 
gies. Their model builds on standard perceptron-style 
updates (?) in which a full pass of decoding is done 
before any updates are made, and thus does not fit into 
the search optimization framework we have outlined. 
Collins and Roark found experimentally that stopping 
the parsing early whenever the correct solution falls 
out of the beam results in drastically improved perfor- 
mance. However, theyhad little theoretical justifica- 



tion for doing so. These "early updates," however, do 
strongly resemble our update strategy, with the differ- 
ence that when Collins and Roark make an error, they 
stop decoding the current input and move on to the 
next; on the other hand, when our model makes an 
error, it continues from the correct solution (s). This 
choice is justified both theoretically and experimen- 
tally. On the tasks reported in this paper, we observe 
the same phenomenon: early updates are better than 
no early updates, and the search optimization frame- 
work is better than early updates. For instance, in the 
joint tagging/chunking task from Section 4.2 using a 
beam of 10, we achieved an f-score of 94.4 in our frame- 
work; using only early updates, this drops to 93.1 and 
using standard perceptron updates, it drops to 92.5. 

Our work also bears a resemblance to training local 
classifiers and combining them together with global in- 
ference (?). The primary difference is that when learn- 
ing local classifiers, one must assume to have access to 
all possible decisions and must rank them according 
to some loss function. Alternatively, in our model, 
one only needs to consider alternatives that are in the 
queue at any given time, which gives us direct access 
to those aspects of the search problem that are eas- 
ily confused. This, in turn, resembles the online large 
margin algorithms proposed by ? (?), which suffer 
from the problem that the arg max must be computed 
exactly. Finally, one can also consider our framework 
in the context of game theory, where it resembles the 
iterated gradient ascent technique described by ? (?) 
and the closely related marginal best response frame- 
work (?). 

We believe that LaSO provides a powerful framework 
to learn to predict structured outputs. It enables one 
to build highly effective models of complex tasks ef- 
ficiently, without worrying about how to normalize a 
probability distribution, compute expectations, or es- 
timate marginals. It necessarily suffers against proba- 
bilistic models in that the output of the classifier will 
not be a probability; however, in problems with ex- 
ponential search spaces, normalizing a distribution is 
quite impractical. In this sense, it compares favor- 
ably with the energy-based models proposed by, for 
example, ? (?), which also avoid probabilistic nor- 
malization, but still require the exact computation 
of the arg max. We have applied the model to two 
comparatively trivial tasks: chunking and joint tag- 
ging/chunking. Since LaSO is not limited to problems 
with clean graphical structures, we believe that this 
framework will be appropriate for many other com- 
plex structured learning problems. 
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Appendix A. Proof of Theorem 4 

We follow ? (?), Thm 3, modifying the bound of the nor- 
malization factor when projecting w; suppose w is the op- 
timal separating hyperplane. Denoting the normalization 
factor Nk on update k, we find: N% +1 < \\wk +r/fcA|| 2 < 
|io fc || 2 + n l + 2rj k w k T A < 1 + r& + 2(1 - Q)77 fc 7 ( 7 is 
the margin) by observing A is bounded above by 7 since 
™ T [E„ esl6s n)/\sibs\ - T,„enodes *fo n)/\nodes\] < 
w T [max„ esl6s $(a;,n) — mm nenod es&(x,n)] < 7, due to 
the definition of the margin. Nk is bounded by 1 + 
(8/q — 6)/fe to bound number of updates m by 7m < 
(4/a — 2)\/A/a — 3 + m/2. Algebra completes the proof. 



